Introduction

Figure caption: Main causes of deforestation in central Menabe. a-a’: Slash-and-burn agriculture (“hatsake”) for peanut crop. Peanut (a’) is cultivated as a cash crop. Part of the production is at the destination of the national market but most of it is exported outside Madagascar, mainly for the Chinese market. b-b’: Slash-and-burn agriculture for maize crop. maize (b’) is cultivated for auto-consumption and as a cash crop. The production of maize is at the destination of the national market and is used in particular to brew the national beers. c-c’: Cyclone followed by uncontrolled fires. Cyclone “Fanele” (2009) caused tree mortality and accumulation of wood fuel on the ground. As a consequence, uncontrolled fires set on nearby pastures (c’) spread over large areas of forest after 2009. d-d’: Illegal logging. Timbers are used for house and boat construction.

This tutorial is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.

Get the tutorial

In RStudio, open a terminal and clone the GitHub repository locally:
git clone https://github.com/ghislainv/forecasting-deforestation-Mada

Navigate with RStudio in the folder named forecasting-deforestation-Mada and open the R Notebook file named training.Rmd. R Markdown files (*.Rmd) can be used in R to embbed text, code, table and figues inside a unique document. Code are organized into ‘chunks’ that can be executed independently and interactively. An R Notebook is an R Markdown document with output visible immediately beneath the input.

I used this notebook to write the tutorial available at https://forestatrisk.cirad.fr/tutorial.

In the preambule, change the author and date with your name and today’s date.

Computing environment

Python in R

Specify the Python version to use with reticulate and check that it is the right one.

## python:         /home/ghislain/miniconda3/envs/forestatrisk/bin/python
## libpython:      /home/ghislain/miniconda3/envs/forestatrisk/lib/libpython3.7m.so
## pythonhome:     /home/ghislain/miniconda3/envs/forestatrisk:/home/ghislain/miniconda3/envs/forestatrisk
## version:        3.7.6 (default, Jan  8 2020, 19:59:22)  [GCC 7.3.0]
## numpy:          /home/ghislain/miniconda3/envs/forestatrisk/lib/python3.7/site-packages/numpy
## numpy_version:  1.17.4
## 
## python versions found: 
##  /home/ghislain/miniconda3/envs/forestatrisk/bin/python
##  /usr/bin/python3
##  /usr/bin/python
##  /home/ghislain/miniconda3/bin/python

Import the Python modules into R.

R/Python intro

R

Some exercises
## [1] 21  6  7  3 10 13
Simple regression

## 
## Call:
## lm(formula = y.seq ~ x.seq, data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.232 -12.821   0.727  13.166  46.325 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.33222    4.00372  -0.083    0.934    
## x.seq        2.00548    0.06927  28.952   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.5 on 98 degrees of freedom
## Multiple R-squared:  0.8953, Adjusted R-squared:  0.8943 
## F-statistic: 838.2 on 1 and 98 DF,  p-value: < 2.2e-16
## (Intercept)       x.seq 
##  -0.3322161   2.0054801
## [1] 338.867
## [1] 98

Python

Some exercises

You can execute Python code within a R notebooks. We need to indicate which version of Python to use in the notebook for Python chunk.

## <class 'numpy.ndarray'>
## [21  6  7  3 10 13]

Preparing the data-set

Downloading the geospatial data

To model the spatial probability of deforestation, we need a map of the past deforestation and maps of potential explicative environmental variables. Environmental variables can be derived from topography (altitude and slope), accessibility (distance to roads, towns, and forest edge), deforestation history (distance to previous deforestation) or landscape policy (protected area network) for example. In our example, we use the following variables :

Product Source Variable derived Unit Resolution (m)
Deforestation maps (1990-2000-2010) BioSceneMada (1) forest/non-forest 30
distance to forest edge m 30
distance to previous deforestation m 30
Digital Elevation Model SRTM v4.1 CSI-CGIAR (2) altitude m 90
slope ° 90
Highways OSM - Geofabrik (3) distance to roads m 150
Places distance to towns m 150
Waterways distance to river m 150
Protected areas Rebioma (4) presence of protected area 30
  1. http://bioscenemada.cirad.fr, deforestation maps from Vieilledent et al. (2018)
  2. http://srtm.csi.cgiar.org, SRTM 90m Digital Elevation Database v4.1
  3. http://www.geofabrik.de, data extracts from the OpenStreetMap project for Madagascar,
  4. http://rebioma.net, SAPM (“Système des Aires Protégées à Madagascar”), 20/12/2010 version

In our example, fordefor.tif is a forest raster at 30m for the year 2010 considering the deforestation on the period 2000-2010 in Madagascar. We can plot this raster and zoom on a region with function .plot.forest() in the forestatrisk package to see the deforestation data. The remaining forest appears in green and the deforested areas appear in red.

Sampling points

We use the function .sample() from the forestatrisk module to sample 10,000 points (pixel centers) in the deforested areas and in the remaining forest (20,000 points in total). The input_forest_raster argument defines the path to the forest raster including the deforested pixels (with value=0) and the remaining forest pixels (value=1) after a given period of time. The random seed can be set with argument Seed to reproduce the data of the random sampling.

The .sample() function also extracts information from environmental variables for each sampled point. Sampling is done by block to allow the computation on large study areas (e.g. country or continental scale) with a fine spatial resolution (e.g. 30m). The var_dir argument indicates the directory including all the environmental raster files (they must be GeoTIFF files with extension .tif) that we want to test.

The .sample() function identifies the spatial cell for each sample point (sample point are grouped within a spatial cell). Spatial cells and grouped observations are used to estimate the spatial autocorrelation of the deforestation process. The csize argument define the width (in km) of the square spatial cells. Each spatial cell will be attributed a parameter. To avoid estimating too many parameters, width of the square spatial cells must be sufficiently large. Both points sampling and extraction of environmental data are done by block to avoid memory problems for big datasets.

## Help on function sample in module forestatrisk.sample:
## 
## sample(nsamp=10000, Seed=1234, csize=10, var_dir='data', input_forest_raster='forest.tif', output_file='output/sample.txt', blk_rows=0)
##     Sample points and extract raster values.
##     
##     This function (i) randomly draw spatial points in deforested and
##     forested areas and (ii) extract environmental variable values for
##     each spatial point.
##     
##     :param nsamp: number of random spatial points.
##     :param seed: seed for random number generator.
##     :param csize: spatial cell size in km.
##     :param var_dir: directory with raster data.
##     :param input_forest_raster: name of the forest raster file (1=forest,     0=deforested) in the var_dir directory
##     :param output_file: path to file to save sample points.
##     :param blk_rows: if > 0, number of lines per block.
##     :return: a pandas DataFrame, each row being one observation.
##   altitude dist_defor dist_edge dist_river dist_road dist_town fordefor sapm
## 1       62        258        60       4460      5972      8580        0    0
## 2      253        703       703       2970      2754      4621        0    1
## 3      116       2145       268      14829     14086     14335        0    0
## 4      164       1834        30       3189      4053      4053        0    1
## 5      203        268       170       6244     12399     13064        0    0
## 6      112       7778       324       7270      8396      6941        0    0
##   slope       X       Y  cell
## 1     4  774525 7451835 10010
## 2     7  407595 7402695 10297
## 3     1  344865 7499415  9562
## 4     2  403395 7395705 10378
## 5     1  449445 7283475 11274
## 6    10 1043835 8463045  1775

Sampled observations can be plotted using function .plot.obs() from the forestatrisk module. Dark red dots indicate deforestation observations and dark green dots indicate forest observations.

Descriptive statistics

Before modelling the deforestation, it is important to look at the relationship between environmental variables and deforestation. Using formulas from the patsy Python module, we can specify the relationships that we want to look at. In the example below, we plot the relationships between some continuous environmental variables and the probability of deforestation using function .plot.correlation() from the forestatrisk package. Note that -1 must be set at the end of the formula. The function .correlation() returns a serie of graphics that can be analyzed to choose the right relationship for each continuous variable (linear or polynomial for example).

In this example (see pdf files produced), we can see that a linear model should be sufficient to represent the relationship between the probability of deforestation and the standardized distance to the nearest road or town. On the contrary, it might be interesting to fit a polynomial model for the the standardized distance to previous deforestation (dist_defor variable) for which the relationship seems non-linear. Several models can be fitted and compared to see if a second-order or third-order polynomial relationship is justified.

Deforestation model

We propose to use the Binomial icar model (Vieilledent et al. 2014) to estimate the deforestation probability of a pixel given a set of environmental variables. The Binomial icar model is a linear Binomial logistic regression model including an intrinsic Conditional Autoregressive (icar) process to account for the spatial autocorrelation of the observations. Parameter inference is done in a hierarchical Bayesian framework. The .model_binomial_icar() function from the forestatrisk module includes a Metropolis-within-Gibbs algorithm written in pure C code to reduce computation time.

Figure caption: Parameter inference is done in a hierarchical Bayesian framework. Bayesian statistics rely on the Bayes’ theorem named after Reverend Thomas Bayes. Each parameter has a prior and an approximated posterior probability distribution from which we can compute the mean, standard deviation, credible intervals at 95%, etc.

For the deforestation process it is very important to take into account the spatial autocorrelation of the process with spatial random effects. Indeed, the selected fixed environmental variables are not able to fully explain the spatial variability of the deforestation process, especially when working at large geographical scales, such as the national or continental scale. Spatial random effects allow estimating a higher/lower probability of deforestation in a particular region (associated to unmeasurable or unknow factors) that is different from the mean probability of deforestation derived from the environmental factors included in the model. The Binomial icar model can be described as follow:

Ecological process

\[\begin{equation} y_i \sim \mathcal{B}inomial(\theta_i,t_i) \\ \text{logit}(\theta_i) = X_i \beta + \rho_{j(i)} \end{equation}\]

\(y_i\): random variable for the deforestation process (0 if no deforestation, 1 if deforestation)

\(\theta_i\): probability of deforestation

\(t_i\): number of trials (always 1 in our example)

\(X_i\): vector of values for environmental explicative variables

\(\beta\): vector of fixed effect parameters

\(\rho_j\): spatial random effect

\(j(i)\): index of the spatial entity for observation \(i\).

Spatial autocorrelation

The spatial autocorrelation is modelled with an intrinsic conditional autoregressive (icar) process:

\[\begin{equation} \rho_j \sim \mathcal{N}ormal(\mu_j,V_{\rho} / n_j) \end{equation}\]

\(\mu_j\): mean of \(\rho_{j'}\) in the neighborhood of \(j\).

\(V_{\rho}\): variance of the spatial random effects.

\(n_j\): number of neighbors for spatial entity \(j\).

Figure caption: Representation of the neighborhood for the intrinsic conditional autoregressive (icar) process. Target spatial cell \(j\) has 8 neighbors in this case. Several observations (black points, equivalent to pixel centers in our case) can be located in each spatial cell. Deforestation probability in one spatial cell \(j\) depends on deforestation probability in neighboring cells.

Model formula

A model formula must also be defined to specify the explicative variables we want to include in the model. The formula allows specifying some variable transformations (such as standardization in our case). See the patsy module for more information. In our model, we included the following variables: location inside a protected area, altitude, distance to past deforestation (with a degree two polynomial), distance to forest edge, distance to nearest road and distance to nearest town. The formula must end with the name of the variable indicating the spatial cell for each observation (cell in our case).

Model summary

Once the model has been fitted, we can print a summary of the model showing the parameter estimates. The 95% credible intervals obtained from the posterior distribution of each parameter, except distance to nearest town (dist_town), do not include zero, indicating that parameters are significantly different from zero. The variance of the spatial random effects (Vrho) is given together with the deviance value, which can be used to compare different statistical models (lower deviance is better). Looking at the parameter estimates, we can see that the deforestation probability is much lower inside protected areas and that deforestation probability decreases with altitude, slope, distance to past deforestation, forest edge, roads and towns. Parameter values are then coherent regarding the deforestation process and easy to interpret.

## Binomial logistic regression with iCAR process
##   Model: I(1 - fordefor) + trials ~ 1 + C(sapm) + scale(altitude) + scale(slope) + scale(dist_defor) + scale(dist_edge) + scale(dist_road) + scale(dist_town) + cell
##   Posteriors:
##                         Mean        Std     CI_low    CI_high
##         Intercept     -0.297     0.0649     -0.417      -0.15
##    C(sapm)[T.1.0]     -0.445     0.0914     -0.634     -0.285
##   scale(altitude)     -0.516     0.0713     -0.685     -0.392
##      scale(slope)     -0.147     0.0371     -0.221    -0.0774
## scale(dist_defor)     -0.432      0.045     -0.513     -0.345
##  scale(dist_edge)     -0.667     0.0509     -0.765     -0.566
##  scale(dist_road)    -0.0449     0.0506     -0.139     0.0539
##  scale(dist_town)      -0.15     0.0553     -0.248    -0.0382
##              Vrho       7.28      0.506        6.4       8.33
##          Deviance   9.84e+03       58.6   9.73e+03   9.96e+03
## Binomial logistic regression with iCAR process
##   Model: I(1 - fordefor) + trials ~ 1 + C(sapm) + scale(altitude) + scale(slope) + scale(dist_defor) + scale(dist_edge) + scale(dist_road) + scale(dist_town) + cell
##   Posteriors:
##                         Mean        Std     CI_low    CI_high
##         Intercept     -0.297     0.0649     -0.417      -0.15
##    C(sapm)[T.1.0]     -0.445     0.0914     -0.634     -0.285
##   scale(altitude)     -0.516     0.0713     -0.685     -0.392
##      scale(slope)     -0.147     0.0371     -0.221    -0.0774
## scale(dist_defor)     -0.432      0.045     -0.513     -0.345
##  scale(dist_edge)     -0.667     0.0509     -0.765     -0.566
##  scale(dist_road)    -0.0449     0.0506     -0.139     0.0539
##  scale(dist_town)      -0.15     0.0553     -0.248    -0.0382
##              Vrho       7.28      0.506        6.4       8.33
##          Deviance   9.84e+03       58.6   9.73e+03   9.96e+03

Plot traces and posteriors

To check for the convergence of the Markov chain Monte Carlo (MCMC), we can plot the traces and the posterior distributions of the estimated parameters using method .plot() associated to the hSDM_binomial_icar class defined in the deforestprob module. This method returns the figures showing the traces and posterior distributions.

Forecasting deforestation

Resampling the spatial random effects

We use the model to predict the spatial probability of deforestation at the national scale for Madagascar. Before, doing so, we smooth the spatial random effects which have been estimated at a coarse resolution (10km in our example). To do so, we use the function .resample_rho() from the deforestprob module to resample the results at a finer resolution using a bilinear interpolation. The function writes a raster file to the disk with a resolution of the raster specified in the argument input_raster of the function (1km in our case). The function .resample_rho() returns a figure of the spatial random effects that can be plotted.

## Results plotted to "output/rho_ggplot.png"

Spatial probability of deforestation in 2010

The .predict_raster_binomial_icar() function of the forestatrisk module can be used to predict the spatial probability of deforestation from an hSDM_binomial_icar model (i.e. an object of class hSDM_binomial_icar). The function writes a raster of predictions to the disk. The prediction is done by block to avoid memory problems for big datasets. Functions will return NA for pixels with no forest or for pixels with missing environmental variables.

The raster of predictions can then be plotted.

Predicting future forest cover

Given the spatial probability of deforestation and the number of hectares that should be deforested, we can predict the future forest cover using function .deforest() from the forestatrisk package. The number of hectares are converted into number of pixels to be deforested. Pixels with the highest probability of deforestation are deforested first. The function computes a probability threshold above which pixels are deforested.

In our example, we consider an annual deforestation of roughly 100,000 ha for Madagascar. Considering the period 2010-2050, this would correspond to 4 Mha of deforestation. This number can of course be more precise and refined considering various deforestation scenarios (demographic growth, economic development, etc.).

We can plot the predicted future forest cover in 2050 with leaflet. We first reproject the raster to WGS 84 / World Mercator projection (EPSG:3857) and we resample the map at 250 m using function gdalwarp from GDAL. The 250 m resolution raster that can be transformed into a lower size image that can be plotted with leaflet.

We also set the color of the map and the extent of the view for leaflet.

The red areas represent the deforestation on the period 2010-2050. The green areas represent the remaining forest in 2050. Most of the remaining forest in 2050 are inside the protected areas or located in remote areas, at high altitudes and far from roads and big cities (for example in the Tsaratanana mountain region and around the Masoala peninsula, north-east Madagascar).

Model comparison

Model types

  • null model
  • glm model: simple glm with no spatial random effect model
##                   Generalized Linear Model Regression Results                  
## ===============================================================================
## Dep. Variable:         I(1 - fordefor)   No. Observations:                10000
## Model:                             GLM   Df Residuals:                     9999
## Model Family:                 Binomial   Df Model:                            0
## Link Function:                   logit   Scale:                          1.0000
## Method:                           IRLS   Log-Likelihood:                -6931.0
## Date:              dim., 12 janv. 2020   Deviance:                       13862.
## Time:                         18:00:40   Pearson chi2:                 1.00e+04
## No. Iterations:                      3                                         
## Covariance Type:             nonrobust                                         
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## Intercept     -0.0200      0.020     -1.000      0.317      -0.059       0.019
## ==============================================================================
##                   Generalized Linear Model Regression Results                  
## ===============================================================================
## Dep. Variable:         I(1 - fordefor)   No. Observations:                10000
## Model:                             GLM   Df Residuals:                     9992
## Model Family:                 Binomial   Df Model:                            7
## Link Function:                   logit   Scale:                          1.0000
## Method:                           IRLS   Log-Likelihood:                -6421.4
## Date:              dim., 12 janv. 2020   Deviance:                       12843.
## Time:                         18:00:40   Pearson chi2:                 1.13e+04
## No. Iterations:                      5                                         
## Covariance Type:             nonrobust                                         
## =====================================================================================
##                         coef    std err          z      P>|z|      [0.025      0.975]
## -------------------------------------------------------------------------------------
## Intercept             0.0334      0.027      1.225      0.221      -0.020       0.087
## C(sapm)[T.1.0]       -0.3073      0.048     -6.438      0.000      -0.401      -0.214
## scale(altitude)      -0.1798      0.026     -6.916      0.000      -0.231      -0.129
## scale(slope)         -0.1590      0.025     -6.243      0.000      -0.209      -0.109
## scale(dist_defor)    -0.3606      0.030    -11.945      0.000      -0.420      -0.301
## scale(dist_edge)     -0.4841      0.035    -13.988      0.000      -0.552      -0.416
## scale(dist_road)      0.0780      0.023      3.385      0.001       0.033       0.123
## scale(dist_town)     -0.1405      0.023     -6.222      0.000      -0.185      -0.096
## =====================================================================================
##                   Generalized Linear Model Regression Results                  
## ===============================================================================
## Dep. Variable:         I(1 - fordefor)   No. Observations:                10000
## Model:                             GLM   Df Residuals:                     9992
## Model Family:                 Binomial   Df Model:                            7
## Link Function:                   logit   Scale:                          1.0000
## Method:                           IRLS   Log-Likelihood:                -6421.4
## Date:              dim., 12 janv. 2020   Deviance:                       12843.
## Time:                         18:00:40   Pearson chi2:                 1.13e+04
## No. Iterations:                      5                                         
## Covariance Type:             nonrobust                                         
## =====================================================================================
##                         coef    std err          z      P>|z|      [0.025      0.975]
## -------------------------------------------------------------------------------------
## Intercept             0.0334      0.027      1.225      0.221      -0.020       0.087
## C(sapm)[T.1.0]       -0.3073      0.048     -6.438      0.000      -0.401      -0.214
## scale(altitude)      -0.1798      0.026     -6.916      0.000      -0.231      -0.129
## scale(slope)         -0.1590      0.025     -6.243      0.000      -0.209      -0.109
## scale(dist_defor)    -0.3606      0.030    -11.945      0.000      -0.420      -0.301
## scale(dist_edge)     -0.4841      0.035    -13.988      0.000      -0.552      -0.416
## scale(dist_road)      0.0780      0.023      3.385      0.001       0.033       0.123
## scale(dist_town)     -0.1405      0.023     -6.222      0.000      -0.185      -0.096
## =====================================================================================

Spatial cross-validation

Using the validation data-set, we can compute several model performance indices (Liu, White, and Newell 2011):

  • FOM: figure of merit
  • OA: overall accuracy
  • EA: expected accuracy
  • Spe: specificity
  • Sen: sensitivity
  • TSS: true skill statistics
  • K: kappa of Cohen
  • AUC: area under the (ROC) curve

It is known that the value of some of these indices might vary with the intensity of deforestation (Pontius et al. 2008).

In our case, we compute these indices for various values of the deforestation rate (in percentage of deforestation observations) in our validation data-set (1, 5, 10, 25, 50). For each percentage, the threshold used to convert the predicted probabilities of deforestation were choosen in order to respect the deforestation rate (in %) in the observations.

##        mod  FOM   OA  EA  Spe  Sen  TSS    K  AUC
## 1     null 0,33 0,50 0,5 0,50 0,50 0,00 0,00 0,50
## 2 nochange 0,00 0,50 0,5 1,00 0,00 0,00 0,00 0,50
## 3      glm 0,45 0,62 0,5 0,62 0,62 0,23 0,23 0,66
## 4     icar 0,58 0,73 0,5 0,73 0,73 0,47 0,47 0,81

Comparing predictions on the long term

References

Liu, Canran, Matt White, and Graeme Newell. 2011. “Measuring and Comparing the Accuracy of Species Distribution Models with Presence-Absence Data.” Ecography 34 (2): 232–43. https://doi.org/10.1111/j.1600-0587.2010.06354.x.

Pontius, Robert, Wideke Boersma, Jean-Christophe Castella, Keith Clarke, Ton de Nijs, Charles Dietzel, Zengqiang Duan, et al. 2008. “Comparing the Input, Output, and Validation Maps for Several Models of Land Change.” The Annals of Regional Science 42 (1): 11–37. https://doi.org/10.1007/s00168-007-0138-2.

Vieilledent, Ghislain, Clovis Grinand, Fety A. Rakotomalala, Rija Ranaivosoa, Jean-Roger Rakotoarijaona, Thomas F. Allnutt, and Frédéric Achard. 2018. “Combining Global Tree Cover Loss Data with Historical National Forest Cover Maps to Look at Six Decades of Deforestation and Forest Fragmentation in Madagascar.” Biological Conservation 222: 189–97. https://doi.org/10.1016/j.biocon.2018.04.008.

Vieilledent, Ghislain, Cory Merow, Jérôme Guélat, Andrew M. Latimer, Marc Kéry, Alan E. Gelfand, Adam M. Wilson, Frédéric Mortier, and John A. Silander Jr. 2014. hSDM: hierarchical Bayesian species distribution models. https://doi.org/10.5281/zenodo.594920.